191 research outputs found

    Linguistically informed and corpus informed morphological analysis of Arabic

    No full text
    Standard English PoS-taggers generally involve tag-assignment (via dictionary-lookup etc) followed by tag-disambiguation (via a context model, e.g. PoS-ngrams or Brill transformations). We want to PoS-tag our Arabic Corpus, but evaluation of existing PoS-taggers has highlighted shortcomings; in particular, about a quarter of all word tokens are not assigned a fully correct morphological analysis. Tag-assignment is significantly more complex for Arabic. An Arabic lemmatiser program can extract the stem or root, but this is not enough for full PoS-tagging; words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and postclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and postclitics). Many challenges face the implementation of Arabic morphology, the rich “root-and-pattern” nonconcatenative (or nonlinear) morphology and the highly complex word formation process of root and patterns, especially if one or two long vowels are part of the root letters. Moreover, the orthographic issues of Arabic such as short vowels ( َ ُ ِ ), Hamzah (ء أ إ ؤ ئ), Taa’ Marboutah ( ة ) and Ha’ ( ه ), Ya’ ( ي ) and Alif Maksorah( ى ) , Shaddah ( ّ ) or gemination, and Maddah ( آ ) or extension which is a compound letter of Hamzah and Alif ( أا ). Our morphological analyzer uses linguistic knowledge of the language as well as corpora to verify the linguistic information. To understand the problem, we started by analyzing fifteen established Arabic language dictionaries, to build a broad-coverage lexicon which contains not only roots and single words but also multi-word expressions, idioms, collocations requiring special part-of-speech assignment, and words with special part-of-speech tags. The next stage of research was a detailed analysis and classification of Arabic language roots to address the “tail” of hard cases for existing morphological analyzers, and analysis of the roots, word-root combinations and the coverage of each root category of the Qur’an and the word-root information stored in our lexicon. From authoritative Arabic grammar books, we extracted and generated comprehensive lists of affixes, clitics and patterns. These lists were then cross-checked by analyzing words of three corpora: the Qur’an, the Corpus of Contemporary Arabic and Penn Arabic Treebank (as well as our Lexicon, considered as a fourth cross-check corpus). We also developed a novel algorithm that generates the correct pattern of the words, which deals with the orthographic issues of the Arabic language and other word derivation issues, such as the elimination or substitution of root letters

    Development of tag sets for part-of-speech tagging

    Get PDF
    This article discusses tag sets used when PoS-tagging a corpus, that is, enriching a corpus by adding a part-of-speech tag to each word. This requires a tag-set, a list of grammatical category labels; a tagging scheme, practical definitions of each tag or label, showing words and contexts where each tag applies; and a tagger, a program for assigning a tag to each word in the corpus, implementing the tag-set and tagging-scheme in a tag-assignment algorithm. We start by reviewing tag-sets developed for English corpora in section 1, since English was the first language studied by corpus linguists. Pioneering corpus linguists thought that their English corpora could be more useful research resources if each word was annotated with a Part-of-Speech label or tag. Traditional English grammars generally provide 8 basic parts of speech, derived from Latin grammar. However, most tag-set developers wanted to capture finer grammatical distinctions, leading to larger tag-sets. PoS-tagged English corpora have been used in a wide range of applications. Section 2 examines criteria used in development of English corpus Part-of-Speech tag sets: mnemonic tag names; underlying linguistic theory; classification by form or function; analysis of idiosyncratic words; categorization problems; tokenisation issues: defining what counts as a word; multi-word lexical items; target user and/or application; availability and/or adaptability of tagger software; adherence to standards; variations in genre, register, or type of language; and degree of delicacy of the tag-set. To illustrate these issues, section 3 outlines a range of examples of tag set developments for different languages, and discusses how these criteria apply. First we consider tag-sets for an online Part-of-Speech tagging service for English; then design of a tag-set for another language from the same broad Indo-European language family, Urdu; then for a non-Indo-European language with a highly inflexional grammar, Arabic; then for a contrasting non-Indo-European language with isolating grammar, Malay. Finally, we present some conclusions in section 4, and references in section 5

    Current European research in computational linguistics

    Get PDF

    Clustering of word types and unification of word tokens into grammatical word-classes

    Get PDF
    This paper discusses Neopsy: unsupervised inference of grammatical word-classes in Natural Language. Grammatical inference can be divided into inference of grammatical word-classes and inference of structure. We review the background of supervised learning of Part-of-Speech tagging; and discuss the adaptation of the three main types of Part-of-Speech tagger to unsupervised inference of grammatical word-classes. Statistical N-gram taggers suggest a statistical clustering approach, but clustering does not help with low-frequency word-types, or with the many word-types which can appear in more than one grammatical category. The alternative Transformation-Based Learning tagger suggests a constraint-based approach of unification of word-token contexts. This offers a way to group together low-frequency word-types, and allows different tokens of one word-type to belong to different categories according to grammatical contexts they appear in. However, simple unification of word-token-contexts yields an implausibly large number of Part-of-Speech categories; we have attempted to merge more categories by "relaxing" matching context to allow unification of word-categories as well as word-tokens, but this results in spurious unifications. We conclude that the way ahead may be a hybrid involving clustering of frequent word-types, unification of word-token-contexts, and "seeding" with limited linguistic knowledge. We call for a programme of further research to develop a Language Discovery Toolkit

    Design and Implementing Of Multilingual Hadith Corpus

    Get PDF
    In this paper, we want to establish the first design of Multilingual Hadith Corpus. The Hadith original language is Arabic and we decide to select English, French and Russian as extra languages for Hadith translation. Design the Hadith corpus will be in four steps, the first step is data collection, which will be from the internet because it is considered as the biggest corpora, second step cleaning the data, step three file generation and the last step is file annotation using XML

    Towards a Computational Lexicon for Arabic Formulaic Sequences

    Get PDF

    Using Arabic Numbers (Singular, Dual, and Plurals) Patterns To Enhance Question Answering System Results

    Get PDF
    In the field of information retrieval, it is very difficult to answer the question entered by the user, because the search engine retrieve a ranked documents that contain any key word or phrase inside the documents, this need another extra effort to search the answer inside the documents, and there may be no answer. The alternative of search engine is a question answering system, which it retrieves the exact answer of the question in the natural language if found. A question answering system accepts the question in the natural, then many processes were done to extract the exact answer. In general a question answering system is composed of three main components: question classification module, information retrieval module and answer extraction module. A question answering system is applied in holy Quran which written and cited in Arabic language, some characteristic of the Arabic language were used to enhance the answer extraction, one of these important characteristics is numbering, singular, dual and plural. A prototype build uses special pattern used to process the number in Arabic language, which enhance the answers by adding more words and meaning. A corpus of questions and its answers from holy Quran used to test and answers the question

    Computer-aided error annotation: a new tool for annotating Arabic error

    Get PDF
    Existing tools for annotating errors in learner corpora are developed for languages other than Arabic. Thus, this poster introduces a new tool for computer-aided error annotation in Arabic learner corpora

    Design Requirements for Multilingual Hadith Corpus

    Get PDF
    This paper discusses the requirements of users who are willing to be familiarized with Hadith.The first set of questions aimed to personal information.The remaining eight questions about Hadith started by identifying why users reading Hadith also requested them to declare their opinion regarding finding Hadith explanation and word meaning in Arabic, moreover Finding moral for each Hadith, And Hadith explanation, word meaning, and moral in one website. Muslim finding Hadith in different languages, along with the source and classification of each Hadith respectively. Finally, the users had been requested to indicate their search methods preferences whether like to read Hadith from websites or from books

    Developing an Ontology of Concepts in the Qur'an

    Get PDF
    In recent years, there is growing interest in IT for Islamic Knowledge. Researchers in religious studies have started to use ontologies to improve knowledge construction and extraction from religious texts such as the Qur’an and Hadith. An ontology can be used to describe a logical domain theory with very expressive, complex, and meaningful information. Recent research has been done in Arabic language ontology and on holy Qur'an ontology but they are still incomplete. Also there are some other issues including the process used to extract and construct an ontology that needs extra work. This paper describes our actual and ongoing work in developing an ontology. Our approach is to investigate the applicability of ontology methods of formal Knowledge Representation from Artificial Intelligence and Text Analytics research to capture and represent abstract concepts in the Qur'an. We will implement three ontology methods. The first one is to elicit or extract the abstract concepts from experts in the domain. Another approach is to semi-automatically extract concepts from text sources from the domain. The last approach is to find existing partial ontologies for the domain, and try to unify and re-use them. Finally to evaluate our general Qur'an ontology, we will investigate practical use of the ontology in a semantic search application. We have implemented the third approach, merging of existing ontologies, The experimental verification result reveals that our proposed merging methods work well, as checked by human expert. A similarity measure was applied for ontology merging, and we report a high accuracy and recall through experiments
    corecore